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Abstract 

In  this  paper,  we  propose  a  novel  approach  to  extract 
primary  object  segments  in  videos  in  the  ‘ object  proposal  ’ 
domain.  The  extracted  primary  object  regions  are  then  used 
to  build  object  models  for  optimized  video  segmentation. 
The  proposed  approach  has  several  contributions :  First,  a 
novel  layered  Directed  Acyclic  Graph  (DAG)  based  frame¬ 
work  is  presented  for  detection  and  segmentation  of  the  pri¬ 
mary  object  in  video.  We  exploit  the  fact  that,  in  general, 
objects  are  spatially  cohesive  and  characterized  by  locally 
smooth  motion  trajectories,  to  extract  the  primary  objec- 
t  from  the  set  of  all  available  proposals  based  on  motion, 
appearance  and  predicted- shape  similarity  across  frames. 
Second,  the  DAG  is  initialized  with  an  enhanced  object  pro¬ 
posal  set  where  motion  based  proposal  predictions  (from 
adjacent  frames)  are  used  to  expand  the  set  of  object  pro¬ 
posals  for  a  particular  frame.  Last,  the  paper  presents  a 
motion  scoring  function  for  selection  of  object  proposal- 
s  that  emphasizes  high  optical  flow  gradients  at  proposal 
boundaries  to  discriminate  between  moving  objects  and  the 
background.  The  proposed  approach  is  evaluated  using  sev¬ 
eral  challenging  benchmark  videos  and  it  outperforms  both 
unsupervised  and  supervised  state-of-the-art  methods. 


1.  Introduction  &  Related  Work 

In  this  paper,  our  goal  is  to  detect  the  primary  objec- 
t  in  videos  and  to  delineate  it  from  the  background  in  al- 
1  frames.  Video  object  segmentation  is  a  well-researched 
problem  in  the  computer  vision  community  and  is  a  prereq¬ 
uisite  for  a  variety  of  high-level  vision  applications,  includ¬ 
ing  content  based  video  retrieval,  video  summarization,  ac¬ 
tivity  understanding  and  targeted  content  replacement.  Both 
fully  automatic  methods  and  methods  requiring  manual  ini¬ 
tialization  have  been  proposed  for  video  object  segmenta¬ 
tion.  In  the  latter  class  of  approaches,  [2,  15,  23]  need  an¬ 
notations  of  object  segments  in  key  frames  for  initialization. 
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Primary  Object  Regions  Extracted  by  Proposed  Method 


Figure  1.  Primary  object  region  selection  in  the  object  proposal  do¬ 
main.  The  first  row  shows  frames  from  a  video.  The  second  row 
shows  key  object  proposals  (in  red  boundaries)  extracted  by  [13]. 
“?”  indicates  that  no  proposal  related  to  the  primary  object  was 
found  by  the  method.  The  third  row  shows  primary  object  pro¬ 
posals  selected  by  the  proposed  method.  Note  that  the  proposed 
method  was  able  to  find  primary  object  proposals  in  all  frames. 
The  results  in  row  2  and  3  are  prior  to  per-pixel  segmentation.  In 
this  paper  we  demonstrate  that  temporally  dense  extraction  of  pri¬ 
mary  object  proposals  results  in  significant  improvement  in  object 
segmentation  performance.  Please  see  Table  1  for  quantitative  re¬ 
sults  and  comparisons  to  state  of  the  art.  [Please  Print  in  Color] 


Optimization  techniques  employing  motion  and  appearance 
constraints  are  then  used  to  propagate  the  segments  to  al- 
1  frames.  Other  methods  ([16,  20])  only  require  accurate 
object  region  annotation  for  the  first  frame,  then  employ 
region  tracking  to  segment  the  rest  of  frames  into  objec- 
t  and  background  regions.  Note  that,  the  aforementioned 
semi-automatic  techniques  generally  give  good  segmenta- 
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Figure  2.  Object  proposals  from  a  video  frame  employing  the 
method  in  [7].  The  left  side  image  is  one  of  the  video  frames. 
Note  that  the  monkey  is  the  object  of  interest  in  the  frame.  Images 
on  the  right  show  some  of  the  top  ranked  object  proposals  from  the 
frame.  Most  of  the  proposals  do  not  correspond  to  an  actual  ob¬ 
ject.  The  goal  of  the  proposed  work  is  to  generate  an  enhanced  set 
of  object  proposals  and  extract  the  segments  related  to  the  primary 
object  from  the  video. 
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Figure  3.  The  Video  Object  Segmentation  Framework 


tion  results.  However,  most  computer  vision  applications 
involve  processing  of  large  amounts  of  video  data,  which 
makes  manual  initialization  cost  prohibitive.  Consequent¬ 
ly,  a  large  number  of  automatic  methods  have  also  been 
proposed  for  video  object  segmentation.  A  subset  of  these 
methods  employs  motion  grouping  ([19,  18,  4])  for  object 
segmentation.  Other  methods  ([10,  3,  21])  use  appearance 
cues  to  segment  each  frame  first  and  then  use  both  appear¬ 
ance  and  motion  constraints  for  a  bottom-up  final  segmenta¬ 
tion.  Methods  like  [9,  3,  11,  22]  present  efficient  optimiza¬ 
tion  frameworks  for  spatiotemporal  grouping  of  pixels  for 
video  segmentation.  However,  all  of  these  automatic  meth¬ 
ods  do  not  have  an  explicit  model  of  how  an  object  looks 
or  moves,  and  therefore,  the  segments  usually  don’t  corre¬ 
spond  to  a  particular  object  but  only  to  image  regions  that 
exhibit  coherent  appearance  or  motion. 

Recently,  several  methods  ([7,  5,  1])  were  proposed  that 
provided  an  explicit  notion  of  how  a  generic  object  looks 
like.  Specifically,  the  method  [7]  could  extract  object-like 
regions  or  ‘object  proposals’  from  images.  This  work  was 
built  upon  by  Lee  et  al.  [13]  and  Ma  and  Latecki  [14]  to  em¬ 
ploy  object  proposals  for  object  video  segmentation.  Lee 
et  al.  [13]  proposed  to  detect  the  primary  object  by  col¬ 
lecting  a  pool  of  object  proposals  from  the  video,  and  then 
applying  spectral  graph  clustering  to  obtain  multiple  binary 
inlier/outlier  partitions.  Each  inlier  cluster  corresponds  to 
a  particular  object’s  regions.  Both  motion  and  appearance 
based  cues  are  used  to  measure  the  ‘objectness’  of  a  propos¬ 
al  in  the  cluster.  The  cluster  with  the  largest  average  ‘object¬ 
ness’  is  likely  to  contain  the  primary  object  in  video.  One 
shortcoming  of  this  approach  is  that  the  clustering  process 
ignores  the  order  of  the  proposals  in  the  video,  and  there¬ 
fore,  cannot  model  the  evolution  of  object’s  shape  and  loca¬ 
tion  with  time.  The  work  by  Ma  and  Latecki  [14]  attempts 


to  mitigate  this  issue  by  utilizing  relationships  between  ob¬ 
ject  proposals  in  adjacent  frames.  The  object  region  selec¬ 
tion  problem  is  modeled  as  a  constrained  Maximum  Weight 
Cliques  problem  in  order  to  find  the  true  object  region  from 
all  the  video  frames  simultaneously.  However,  this  problem 
is  NP-hard  ([14])  and  an  approximate  optimization  tech¬ 
nique  is  used  to  obtain  the  solution.  The  object  proposal 
based  segmentation  approaches  [13,  14]  have  two  addition¬ 
al  limitations  compared  to  the  proposed  method.  First,  in 
both  approaches,  object  proposal  generation  for  a  particular 
frame  doesn’t  directly  depend  on  object  proposals  generated 
for  adjacent  frames.  Second,  both  approaches  do  not  actu¬ 
ally  predict  the  shape  of  the  object  in  adjacent  frames  when 
computing  region  similarity,  which  degrades  segmentation 
performance  for  fast  moving  objects. 

In  this  paper,  we  present  an  approach  that  though  in¬ 
spired  from  aforementioned  approaches,  attempts  to  remove 
their  shortcomings.  Note  that,  in  general,  an  object’s  shape 
and  appearance  varies  slowly  from  frame  to  frame.  There¬ 
fore,  the  intuition  is  that  the  object  proposal  sequence  in 
a  video  with  high  ‘objectness’,  and  high  similarity  across 
frames  is  likely  to  be  the  primary  object.  To  this  end,  we 
use  optical  flow  to  track  the  evolution  of  object  shape,  and 
compute  the  difference  between  predicted  and  actual  shape 
(along  with  appearance)  to  measure  similarity  of  object  pro¬ 
posals  across  frames.  The  ‘objectness’  is  measured  using 
appearance  and  a  motion  based  criterion  that  emphasizes 
high  optical  flow  gradients  at  the  boundaries  between  ob¬ 
jects  proposals  and  the  background.  Moreover,  the  prima¬ 
ry  object  proposal  selection  problem  is  formulated  as  the 
longest  path  problem  for  Directed  Acyclic  Graph  (DAG), 
for  which  (unlike  [14])  an  optimal  solution  exists  in  lin¬ 
ear  time.  Note  that,  if  the  temporal  order  of  object  pro¬ 
posals  locations  (across  frames)  is  not  used  ([13],  then  it 
can  result  in  no  proposals  being  associated  with  the  prima- 
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ry  object  for  many  frames  (please  see  Figure  1).  The  pro¬ 
posed  method  not  only  uses  object  proposals  from  a  par¬ 
ticular  frame  (please  see  Figure  2),  but  also  expands  the 
proposal  set  using  predictions  from  proposals  of  neighbor¬ 
ing  frame.  The  combination  of  proposal  expansion,  and  the 
predicted  shape  based  similarity  criteria  results  in  tempo¬ 
rally  dense  and  spatially  accurate  primary  object  proposal 
extraction.  We  have  evaluated  the  proposed  approach  using 
several  challenging  benchmark  videos  and  it  outperforms 
both  unsupervised  and  supervised  state-of-the-art  methods 

In  Section  2,  the  proposed  layered  DAG  based  objec- 
t  selection  approach  is  introduced  and  discussed  in  detail; 
In  Section  3,  both  qualitative  and  quantitative  experiments 
results  for  two  publicly  available  datasets  and  some  other 
challenging  videos  are  shown;  The  paper  is  concluded  in 
Section  4. 

2.  Layered  DAG  based  Video  Object  Segmen¬ 

tation 

2.1.  The  Framework 

The  proposed  framework  consists  of  3  stages  (as  shown 
in  Figure  3):  1.  Generation  of  object  proposals  per-frame 
and  then  expansion  of  the  proposal  set  for  each  frame  based 
on  object  proposals  in  adjacent  frames.  2.  Generation  of  a 
layered  DAG  from  all  the  object  proposals  in  the  video.  The 
longest  path  in  the  graph  fulfills  the  goal  of  maximizing  ob- 
jectness  and  similarity  scores,  and  represents  the  most  like¬ 
ly  set  of  proposals  denoting  the  primary  object  in  the  video. 

3.  The  primary  object  proposals  are  used  to  build  objec- 
t  and  background  models  using  Gaussian  mixtures,  and  a 
graph-cuts  based  optimization  method  is  used  to  obtain  re¬ 
fined  per-pixel  segmentation.  Since  the  proposed  approach 
is  centered  around  layered  DAG  framework  for  selection  of 
primary  object  regions,  we  will  start  with  its  description. 

2.2.  Layered  DAG  Structure 

We  want  to  extract  object  proposals  with  high  object- 
ness  likelihood,  high  appearance  similarity  and  smoothly 
varying  shape  from  the  set  of  all  proposals  obtained  from 
the  video.  Also  since  we  want  to  extract  the  primary  ob¬ 
ject  only,  we  want  to  extract  at  most  a  single  proposal  per 
frame.  Keeping  these  objectives  in  mind,  the  layered  DAG 
is  formed  as  follows.  Each  object  proposal  is  represent¬ 
ed  by  two  nodes:  a  ‘beginning  node’  and  an  ‘ending  node’ 
and  there  are  two  types  of  edges:  unary  edges  and  binary 
edges.  The  unary  edges  have  weights  which  measure  the 
objectness  of  a  proposal.  The  details  of  the  function  for  u- 
nary  weight  assignments  (measuring  objectness)  are  given 
in  section  2.2.1.  All  the  beginning  nodes  in  the  same  frame 
form  a  layer,  so  as  the  ending  nodes.  A  directed  unary  edge 
is  built  from  beginning  node  to  ending  node.  Thus,  each 
video  frame  is  represented  by  two  layers  in  the  graph.  Di- 


Figure  4.  Layered  Directed  Acyclic  Graph  (DAG)  Structure.  Node 
“s”  and  “t”  are  source  and  sink  nodes  respectively,  which  have  zero 
weights  for  edges  with  other  nodes  in  the  graph.  The  yellow  nodes 
and  the  green  nodes  are  “beginning  nodes”  and  “ending  nodes” 
respectively  and  they  are  paired  such  that  each  yellow-green  pair 
represents  an  object  proposal.  All  the  beginning  nodes  in  the  same 
frame  are  arranged  in  a  layer  and  the  same  as  the  ending  nodes. 
The  green  edges  are  the  unary  edges  and  red  edges  are  the  binary 
edges. 

rected  binary  edges  are  built  from  any  ending  node  to  all 
the  beginning  nodes  in  latter  layers.  The  binary  edges  have 
weights  which  measure  the  appearance  and  shape  similarity 
between  the  corresponding  object  proposals  across  frames. 
The  binary  weight  assignment  functions  are  introduced  in 
Section  2.2.2. 

Figure  4  is  an  illustration  of  the  graph  structure.  It  shows 
frame  i  —  1,  i  and  i  +  1  of  the  graph,  with  corresponding 
layers  of  2i  —  3,  2i  —  2,  2i  —  l,  2 i,  2z  +  l  and  2i+2.  Note  that, 
only  3  object  proposals  are  shown  for  each  layer  for  simplic¬ 
ity,  however,  there  are  usually  hundreds  of  object  proposals 
for  each  frame  and  the  number  of  object  proposals  for  dif¬ 
ferent  frames  are  not  necessary  the  same.  The  yellow  nodes 
are  “beginning  nodes”,  the  green  nodes  are  “ending  nodes”, 
the  green  edges  are  unary  edges  with  weights  indicating  ob¬ 
jectness  and  the  red  edges  are  binary  edges  with  weight- 
s  indicating  appearance  and  shape  similarity  (note  that  the 
graph  only  shows  some  of  the  binary  edges  for  simplicity). 
There  is  also  a  virtual  source  node  8  and  a  sink  node  t  with 
0  weighted  edges  (black  edges)  to  the  graph.  Note  that,  it 
is  not  necessary  to  build  binary  edges  from  an  ending  node 
to  all  the  beginning  nodes  in  latter  layers.  In  practice,  only 
building  binary  edges  to  the  next  three  subsequent  frames  is 
enough  for  most  of  the  videos. 

2.2.1  Unary  Edges 

Unary  edges  measure  the  objectness  of  the  proposals.  Both 
appearance  and  motion  are  important  to  infer  the  object¬ 
ness,  so  the  scoring  function  for  object  proposals  is  defined 
as  S unary  (r)  =  A(r)  +  M(r),  in  which  r  is  any  object 
proposal,  A(r)  is  the  appearance  score  and  M(r)  is  the  mo¬ 
tion  score.  We  define  M(r)  as  the  average  Frobenius  norm 
of  optical  flow  gradient  around  the  boundary  of  object  pro- 
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Figure  5.  Optical  Flow  Gradient  Magnitude  Motion  Scoring.  In 
row  1,  column  1  shows  the  original  video  frame,  column  2  is  one 
of  the  object  proposals  and  column  3  shows  dilated  boundary  of 
the  object  proposal.  In  row  2,  column  1  shows  the  forward  optical 
flow  of  the  frame,  column  2  shows  the  optical  flow  gradient  mag¬ 
nitude  map  and  column  3  shows  the  optical  flow  gradient  magni¬ 
tude  response  for  the  specific  object  proposal  around  the  boundary. 
[Please  Print  in  Color] 


posal  r.  The  Frobenius  norm  of  optical  flow  gradients  is 
defined  as: 


Mr 


Ux 

Vx 


(1) 


in  which  U  =  (u,v)  is  the  forward  optical  flow  of  the 
frame,  ux ,  vx  and  uy ,  vy  are  optical  flow  gradients  in  x  and 
y  directions  respectively. 

The  intuition  behind  this  motion  scoring  function  is  that, 
the  motions  of  foreground  object  and  background  are  usu¬ 
ally  distinct,  so  boundary  of  moving  objects  usually  implies 
discontinuity  in  motion.  Therefore,  ideally,  the  gradient  of 
optical  flow  should  have  high  magnitude  around  foreground 
object  boundary  (this  phenomenon  could  be  easily  observed 
from  Figure  5).  In  equation  1,  we  use  the  Frobenius  norm  to 
measure  the  optical  flow  gradient  magnitude,  the  higher  the 
value,  the  more  likely  the  region  is  from  a  moving  object. 
In  practice,  usually  the  maximum  of  optical  flow  gradient 
magnitude  does  not  coincide  exactly  with  the  moving  ob¬ 
ject  boundary  due  to  underlying  approximation  of  optical 
flow  calculation.  Therefore,  we  dilate  the  object  proposal 
boundary  and  get  the  average  optical  flow  gradient  magni¬ 
tude  as  the  motion  score.  Figure  5  is  an  illustration  of  this 
process.  The  appearance  scoring  function  A(r)  is  measured 
by  the  objectness  ([7]). 


2.2.2  Binary  Edges 

Binary  edges  measure  the  similarity  between  object  propos¬ 
als  across  frames.  For  measuring  the  similarity  of  regions, 
color,  location,  size  and  shape  are  the  properties  to  be  con¬ 


sidered.  We  define  the  similarity  between  regions  as  the 
weight  of  binary  edges  as  follows: 


& binary  (j*  mi  Gi)  ^  ’  $  color  (Xmi  Gi)  '  & overlap^  mj  Gi) : 


(2) 

in  which  rm  and  rn  are  regions  from  frame  m  and  n,  A  is 
a  constant  value  for  adjusting  the  ratio  between  unary  and 
binary  edges,  Soveriap  is  the  overlap  similarity  between  re¬ 
gions  and  S coi or  is  the  color  histogram  similarity: 


& coiorij'm')  Gi)  hist(rm)  •  hist(rn)  ,  (3) 

in  which  hist(r )  is  the  normalized  color  histogram  for  a 
region  r. 


& overlapiXm •>  Gi) 
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(4) 


in  which  warpmn(rn )  is  the  warped  region  from  rn  by  op¬ 
tical  flow  to  frame  m.  It  is  clear  that  Scoior  encodes  the  col¬ 
or  similarity  between  regions  and  S overlap  encodes  the  size 
and  location  similarity  between  regions.  If  two  regions  are 
close,  and  the  sizes  and  shapes  are  similar,  the  value  would 
be  higher,  and  vice  versa.  Note  that,  unlike  prior  approach¬ 
es  [13,  14],  we  use  optical  flow  to  predict  the  region  (i.e. 
encoding  location  and  shape),  and  therefore  we  are  better 
able  to  compute  similarity  for  fast  moving  objects. 


2.2.3  Dynamic  Programming  Solution 

Until  now,  we  have  built  the  layered  DAG  and  the  objec¬ 
tive  is  clear:  to  find  the  highest  weighted  path  in  the  DAG. 
Assume  the  graph  contains  2F  +  2  layers  (F  is  the  frame 
number),  the  source  node  is  in  layer  0  and  the  sink  node 
is  in  layer  2 F  +  2.  Let  Nij  denotes  the  j th  node  in  it h 
layer  and  E(Nij,  Nki)  denotes  the  edge  from  to  Nm- 
Layer  i  has  Mi  nodes.  Let  P  =  (pi,P2?  -,pm+i)  = 
(Noi,Njlh ,  N{  2n+2)i)  be  a  path  from  source 
to  sink  node.  Therefore, 

m 

Pmax  =  argmsK^2,E{pi,pi+i).  (5) 

i= 1 

Pmax  forms  a  Longest  (simple)  Path  Problem  for  DAG. 
Let  OPT(i,j )  be  the  maximum  path  value  for  from 
source  node.  The  maximum  path  value  satisfies  the  follow¬ 
ing  recurrence  for  i  >  1  and  j  >  1: 


OPT(i,j ) 


max 

k=0...i— 1,1=1...  Mk 


[OPT(k,  l)  +  E(Nkl,Nij)}. 


(6) 
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This  problem  could  be  solved  by  dynamic  programming 
in  linear  time  [12].  The  computational  complexity  for  the 
algorithm  is  0(n  +  m),  in  which  n  is  the  number  of  nodes 


and  m  is  the  number  of  edges.  The  most  important  param¬ 
eter  for  the  layered  DAG  is  the  ratio  A  between  unary  edges 
and  binary  edges.  However,  in  practice,  the  results  are  not 
sensitive  to  it,  and  in  the  experiments  A  is  simply  set  to  be 
1. 

2.3.  Per-pixel  Video  Object  Segmentation 

Once  the  primary  object  proposals  are  obtained  in  a 
video,  the  results  are  further  refined  by  a  graph-based 
method  to  get  per-pixel  segmentation  results.  We  define  a 
spatiotemporal  graph  by  connecting  frames  temporally  with 
optical  flow  displacement.  Each  of  the  nodes  in  the  graph  is 
a  pixel  in  a  frame,  and  edges  are  set  to  be  the  8-neighbors 
within  one  frame  and  the  forward-backward  1 8  neighbors  in 
adjacent  frames.  We  define  the  energy  function  for  labeling 
/  =  [/i,  y*2?  •••,  fn\  of  n  pixels  with  prior  knowledge  of  h: 


E(f,h)  =  ^2Di(fi)  +  A  £  VijifiJj),  (7) 
ies  ( i,j)eN 

where  S  =  {pi, ..., pn}  is  the  set  of  n  pixels  in  the  video,  N 
consists  of  neighboring  pixels,  and  i,j  index  the  pixels,  pi 
could  be  set  to  0  or  1  which  represents  background  or  fore¬ 
ground  respectively.  The  unary  term  Dh{  defines  the  cost  of 
labeling  pixel  i  with  label  fi  which  we  get  from  the  Gaus¬ 
sian  Mixture  Models  (GMM)  for  both  color  and  location. 


=  -log(aU?(fi,  h)  +  (1  -  a)Ul(fi,  h)),  (8) 

where  Uf{.)  is  the  color-induced  cost  and  U\{.)  is  the  loca¬ 
tion  cost. 

For  the  binary  term  Vij  we  follow  the  definitions 

in  [17]: 


)  =  [fi  +  fj]exP  P{Gi  Gj)2,  (9) 

where  [.]  denotes  the  indicator  function  taking  values 
0  and  1,  ( Ci  —  Cj )2  is  the  Euclidean  distance  be¬ 
tween  two  adjacent  nodes  in  RGB  space,  and  /3  = 

We  use  the  graph-cuts  based  minimization  method  in  [8] 
to  obtain  the  optimal  solution  for  equation  7,  and  thus  get 
the  final  segmentation  results.  Next,  we  describe  the  method 
for  object  proposal  generation  that  is  used  to  initialize  the 
video  object  segmentation  process. 

2.4.  Object  Proposal  Generation  &  Expansion 

In  order  to  achieve  our  goal  of  identifying  image  regions 
belonging  to  the  primary  object  in  the  video,  it  is  prefer¬ 
able  (though  not  necessary)  to  have  an  object  proposal  cor¬ 
responding  to  the  actual  object  for  each  frame  in  which  ob¬ 
ject  is  present.  Using  only  appearance  or  optical  flow  based 


Object 
proposal 
(frame  /- 1) 


Optical 

flow 


Predicted 
proposal  for 
frame  / 
(warped) 


Video 
Frame 
(frame  /) 


Selected  Regions 


r 


Expanded  (additional) 
object  proposal  for 
frame  / 


Figure  6.  Object  Proposal  Expansion.  For  each  optical  flow 
warped  object  proposal  in  frame  i  —  1,  we  look  for  object  pro¬ 
posals  in  frame  i  which  have  high  overlap  ratios  with  the  warped 
one.  If  some  object  proposals  all  have  high  overlap  ratios  with 
the  warped  one,  they  are  merged  into  a  new  large  object  propos¬ 
al.  This  process  will  produce  the  right  object  proposal  if  it  is  not 
discovered  by  [7]  from  frame  z,  but  frame  i  —  1. 


cues  to  generate  object  proposals  is  usually  not  enough  for 
this  purpose.  This  phenomenon  could  be  observed  in  the 
example  shown  in  Figure  6.  For  frame  i  in  this  figure,  hun¬ 
dreds  of  object  proposals  were  generated  using  method  in 
[7],  however,  no  proposal  is  consistent  with  the  true  object, 
and  the  object  is  fragmented  between  different  proposals. 

We  assume  that  an  object’s  shape  and  location  changes 
smoothly  across  frames  and  propose  to  enhance  the  set  of 
object  proposals  for  a  frame  by  using  the  proposals  gener¬ 
ated  for  its  adjacent  frames.  The  object  proposal  expansion 
method  works  by  the  guidance  of  optical  flow  (see  Figure 
6).  For  the  forward  version  of  object  proposal  expansion, 
each  object  proposal  r\_x  in  frame  z  —  1  is  warped  by  the 
forward  optical  flow  to  frame  i ,  then  a  check  is  made  if  any 
proposal  r\  in  frame  i  has  a  large  overlap  ratio  with  the 
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warped  object  proposal,  i.e., 


=  \warpi-i,i(ri-i)nri\ 

V{\  '  '  ' 

The  contiguous  overlapped  areas,  for  regions  in  i+1  with 
o  greater  than  0.5,  are  merged  into  a  single  region,  and  are 
used  as  additional  proposals.  Note  that,  the  old  original  pro¬ 
posals  are  also  kept,  so  this  is  an  ‘expansion’  of  the  proposal 
set,  and  not  a  replacement.  In  practice,  this  process  is  car¬ 
ried  out  both  forward  and  backward  in  time.  Since  it  is  an 
iterative  process,  even  if  suitable  object  proposals  are  miss¬ 
ing  in  consecutive  frames,  they  could  potentially  be  pro¬ 
duced  by  this  expansion  process.  Figure  6  shows  an  exam¬ 
ple  image  sequence  where  the  expansion  process  resulted  in 
generation  of  a  suitable  proposal. 

3.  Experiments 

The  proposed  method  was  evaluated  using  two  well- 
known  segmentation  datasets:  SegTrack  dataset  [20]  and 
GaTech  video  segmentation  dataset  [9].  Quantitative  com¬ 
parisons  are  shown  for  SegTrack  dataset  since  ground-truth 
is  available  for  this  dataset.  Qualitative  results  are  shown 
for  GaTech  video  segmentation  dataset.  We  also  evaluated 
the  proposed  approach  on  additional  challenging  videos,  for 
which  we  will  share  the  ground-truth  to  aid  future  evalua¬ 
tions. 


(a)  Birdfall 


(b)  Cheetah 


(e)  Parachute 


3.1.  SegTrack  Dataset 

We  first  evaluate  our  method  on  Segtrack  dataset  [20]. 
There  are  6  videos  in  this  dataset,  and  also  a  pixel-level  seg¬ 
mentation  ground-truth  for  each  video  is  available.  We  fol¬ 
low  the  setup  in  the  literature  ([13,  14]),  and  use  5  (birdfall, 
cheetah,  girl,  monkeydog  and  parachute)  of  the  videos  for 
evaluation  (since  the  ground-truth  for  the  other  one  (pen¬ 
guin)  is  not  useable).  We  use  an  optical  flow  magnitude 
based  model  selection  method  to  infer  the  camera  motion: 
for  static  cameras,  a  background  subtraction  cue  is  also  used 
for  moving  object  extraction;  for  all  the  results  shown  in 
this  section,  the  static  camera  model  was  only  selected  (au¬ 
tomatically)  for  the  “birdfall”  video. 

We  compare  our  method  with  4  state-of-the-art  method- 
s  [14],  [13],  [20]  and  [6]  shown  in  Table  1.  Note  that  our 
method  is  a  unsupervised  method,  and  it  outperforms  all  the 
other  unsupervised  methods  except  for  the  parachute  video 
where  it  is  a  close  second.  Note  that  [20]  and  [6]  are  super¬ 
vised  methods  which  need  an  initial  annotation  for  the  first 
frame.  The  results  in  Table  1  are  the  average  per-frame  pix¬ 
el  error  rate  compared  to  the  ground-truth.  The  definition  is 
[20]: 

XOR(f,  GT) 

error  =  - — - ,  (11) 

F 

where  /  is  the  segmentation  labeling  results  of  the  method, 
GT  is  the  ground-truth  labeling  of  the  video,  and  F  is  the 


Figure  7.  SegTrack  dataset  results.  The  regions  within  the  red 
boundaries  are  the  segmented  primary  objects.  [Please  Print  in 
Color] 
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[14] 

[13] 

[20] 
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birdfall 

155 

189 

288 

252 

454 

cheetah 

633 
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1142 

1217 
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1488 

1698 

1785 

1304 

1755 
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365 

472 

521 

563 

683 

parachute 

220 

221 

201 

235 

502 

Avg. 

452 

542 

592 

594 

791 

supervised? 

N 

N 

N 

Y 

Y 

Table  1 .  Quantitative  results  and  comparison  with  the  state  of  the 
art  on  SegTrack  dataset 

number  of  frames  in  the  video.  Figure  7  shows  qualitative 
results  for  the  videos  of  SegTrack  dataset. 

Figure  8  is  an  example  that  shows  the  effectiveness  of 
the  proposed  layered  DAG  approach  for  temporally  dense 
extraction  of  primary  object  regions.  The  figure  shows  con¬ 
secutive  frames  (frame  38  to  frame  43)  from  “monkeydog” 
video.  The  top  2  rows  show  the  results  of  key-frame  objec- 
t  extraction  method  [13],  and  the  bottom  2  rows  show  our 
object  region  selection  results.  As  one  can  see,  [13]  detects 
the  primary  object  proposal  in  only  one  of  the  frames,  how¬ 
ever,  by  using  the  proposed  approach,  we  can  extract  the 
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Frame  #38  #39  #40 


(a)  Key-frame  Object  Region  Selection 

Frame  #38  #39  #40 


#41  #42  #43 


(b)  Layered  DAG  Object  Region  Selection 


Figure  8.  Comparison  of  object  region  selection  methods.  The  re¬ 
gions  within  the  red  boundaries  are  the  selected  object  regions.  “?” 
means  there  is  no  object  region  selected  by  the  method.  Numbers 
above  are  the  frame  indices.  [Please  Print  in  Color] 


primary  object  region  from  all  the  frames.  This  is  the  main 
reason  that  the  segmentation  results  of  the  proposed  method 
are  better  than  prior  methods. 

3.2.  GaTech  Segmentation  Dataset 

We  also  evaluated  the  proposed  method  on  GaTech  video 
segmentation  dataset.  We  show  qualitative  comparison  of 
results  between  the  proposed  approach  and  the  original 
bottom-up  method  for  the  dataset  in  Figure  9.  As  one  can 
observe,  our  results  could  segment  the  true  foreground  ob¬ 
ject  from  the  background.  The  method  [9]  doesn’t  use  an 
object  model  which  induces  over- segmentation  (although 
the  results  are  very  good  for  the  general  segmentation  prob¬ 
lem). 


(a)  waterski 

-  — 

K 

I  r 

m 

_ _ 

&  * 

K 

r 

(b)  yunakim 


Figure  9.  Object  Segmentation  Results  on  GaTech  Video  Segmen¬ 
tation  Dataset.  Row  1:  orignial  frame,  Row  2:  Segmentation  re¬ 
sults  by  the  bottom-up  segmentation  method  [9].  Row  3:  Video 
object  segmentation  by  the  proposed  method.  The  regions  with¬ 
in  the  red  or  green  boundaries  are  the  segmented  primary  objects. 
[Please  Print  in  Color] 


Video 

Average  per-frame  pixel  error 

Surfing 

1209 

Jumping 

835 

Skiing 

817 

Sliding 

2228 

Big  car 

1129 

Small  car 

272 

Table  2.  Quantitative  Results  on  Persons  and  Cars  dataset 


results  for  this  dataset  (the  average  per- frame  pixel  error  is 
defined  as  the  same  as  SegTrack  dataset  [20]).  Please  go  to 
http://crcv.ucf.edu  for  more  details. 

4.  Conclusions 


3.3.  Persons  and  Cars  Segmentation  Dataset 

We  have  built  a  new  dataset  for  video  object  segmenta¬ 
tion.  The  dataset  is  challenging:  persons  are  in  a  variety 
of  poses;  cars  have  different  speeds,  and  when  they  are  s- 
low,  it  is  very  hard  to  do  motion  segmentation.  We  generate 
ground  truth  for  those  videos.  Figure  10  shows  some  sample 
results  from  this  dataset,  and  Table  2  shows  the  quantitative 


We  have  proposed  a  novel  and  efficient  layered  DAG 
based  approach  to  segment  the  primary  object  in  videos. 
This  approach  also  uses  innovative  mechanisms  to  compute 
the  ‘objectness’  of  a  region  and  to  compute  similarity  be¬ 
tween  object  proposals  across  frames.  The  proposed  ap¬ 
proach  outperforms  the  state  of  the  art  on  the  well-known 
SegTrack  dataset.  We  also  demonstrate  good  segmentation 
performance  on  additional  challenging  data  sets. 
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(b)  Jumping 


(c)  Skiing 


(d)  Sliding 


(e)  Big  car 

fm 

(f)  Small  car 


Figure  10.  Sample  Results  on  Persons  and  Cars  Dataset.  Please  go 
to  http://crcv.ucf.edu  for  more  details. 
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